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ABSTRACT. We consider the problem of cardinality estimation in data stream ap- 
plications, focusing on two techniques that use pseudo-random variates to form low- 
dimensional data sketches. We apply conventional statistical methods to compare 
algorithms based on storing either selected order statistics or random projections. 
We derive estimators of the cardinality in both cases and show that the maximal-term 
estimator is recursively computable and has exponentially decreasing error bounds. 
Furthermore, we show that the estimators have comparable asymptotic efHciency, and 
explain this result by demonstrating an unexpected connection between the two ap- 
proaches. 

Key words: data sketching, data stream, hash function, maximum likelihood estimation, asymptotic 
relative efficiency, space complexity, stable distribution, tail bounds. 

1 Introduction 

High-throughput, transiently obs erved, data streams pose novel and cha llenging problems for com- 



puter scientists and statisticians (jMuthukrishnan 



2005 



Aggarwal 120071 ) . Advances in science and 



technology are continually expanding both the size of data sets available for analysis and the rate 



of data acquisition; examples include incre asingly heavy Internet traffic on routers (jAkella et al 



2003 



Cormode and Muthukrishnan. 



2005b|), high frequency financial transactions, and commercial 



database applications (IWhang et alj . Il99d ) . 

The on-line approximation of properties of data strea ms, such as cardinality, frequency mo- 



2005a 



ments, quant i les, an d empirical entropy, is of great interest (jCormode and Muthukrishnan , 
Harvev et al.1 . |2008| ) . The goal is to construct and maintain sub-li near representati ons of the data 
from which target properties can be inferred with high efficiency (jAggarwal 120071 ). Data stream 
algorithms typically allow only one pass over the data, i.e., data are observed, processed to update 
the representation, and then discarded. By 'efficient' with respect to the inference procedure, we 
mean that estimators are accurate with high probability, while with respect to the handling of 
data, we mean that the algorithm has fast processing and updating time per data element, uses 
low storage, and is insensitive to the order of arrival of data. 
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This article focuses on the problem of estimating the number of distinct items in a data stream 
when storage constraints preclude the possibility of maintaining a comprehensive list of previously 
observed items. The number of distinct items or cardinality can, for example, refer to pairs of source- 
destination IP addresses, observed within a given time window of Internet tr affic, monitored for the 
purpose of anomaly detection, e.g., denial-of-service attacks on the network (jGiroird . l2009l ). There 
is a surprisingly long history of work on cardinality estim a tion i n the computer science literature 
starting from the pioneering work of iFlaiolet and Martini (jl985l ) and developed in isolation from 
mainstream statistical research. Our purpose is to re-analyse these algorithms in 'traditional' 
statistical terms. We will concentrate on sketching algorithms that exploit hash functions to record 
meaningful information, either by storing order statistics or by random projections. 

The paper is organised as follows. In Section [2] we define terms, such as hash function and hash- 
ing, and give a brief and selective history of cardinality estimation algorithms. We then investigate 
two types of algorithms from a conventional statistical viewpoint, deriving maximum likelihood 
estimators (MLEs) for methods based on order statistics in Section [3] and random projections in 
Section For order statistic methods, we show that the choice of sampling distribution is imma- 
terial when sampling from a continuous distribution but that substantial savings in storage can 
be achieved by using samples from the geometric distribution without significant reduction in the 
asymptotic relative efficiency. We also show that these estimators are recursively computable with 
exponentially decreasing error bounds. We then propose an approximate estimator for projection 
methods using a-stable distributions, with a close to zero. Finally, in Section [5l we compare the 
two methods and find unexpectedly that, in a certain sense, they are essentially equivalent. 



2 Definitions and history 



We define a discrete data stream to be a transiently observed sequence of data elements with types 
drawn from a countable, possibly infinite, set X. At discrete time points t = 1, . . . ,T, a pair of the 
form {it,dt) is observed, where it G I is the type of the data element, and dt is an integer- valued 
quantity. Let It be the set of distinct data types observed by time T. 

A basic goal in data stream analysis is to obtain information about the collection a(T) = 
{aj(T),i gIt}, where aj(T) = Ylt=idtH'it = i) is the cumulative quantity of type i at time T. 
When there is no possibility of confusion, we write a and for a(T) and ai{T), respectively. 
Our concern w ill be primarily with the special case when dt > 0, Vt; the cash register case in the 
terminology of 



Cormode et al 



(120031 ). When dt = 1 Vt, we use the expression simple data stream. 
Many summary statistics of interest are functions of a, e.g., c = Yliex^i'^ii'^) ^ 0)' cardinality 
of the set It in the cash register case. Recall that we are assuming that storage constraints make 
it impossible to know a p recisely. 

Hashing (jKnuthl . Il998l ) is a basic tool used in processing data, where the type of data element is 



identified by a complicated label. Hashing was originally designed to speed-up table lookup for the 
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purpose of item retrieval or for identifying similar items. For example, suppose that data elements 
are records of company employees, uniquely identified by complicated labels, that must be stored 
in a table. A hash function can be designed to map the label to an integer value in a given range, 
called the h ash value, indexing the location in the table where the corresponding employee record is 
stored. See 



Press et al 



(I200J for algorithms to construct hash functions. Given the hash function 
and a label, the corresponding record is easily accessible for updating, for example. 

In general, a hash function h t- {1,2,... ,L} is a deterministic function of the input in I 
that has low collision probability, i.e., P [h{i) = h{j),i / j) < where a collision occurs if 

two or more different inputs are mapped to the same hash value (jKnuthl . Il998l ). For our purposes 
we can think of a hash function as the mapping between the seed of a random number generator 
and the first element in the sequence of computer generated pseudo-random numbers, usually 
uniformly distributed over some range. A collection of independent hash functions (/^-i, . . . 5^m) 
then corresponds to the m individual mappings from the seed to the first m elements of a pseudo- 
random sequence. This method of constructing a hash function mapping to pseudo-random numbers 
having a given distribution is known as the method of seeding. 



2.1 Probabilistic counting 



Flaiolet and Martini (| 19851 ) introduce the idea of independently hashing each element i G It to a 
long string of pseudo-random bits, uniformly distributed over a finite range. Let p{i) denote the 
rank of the first bit 1 in h{i). The algorithm stores and updates a bitmap table of all the values of p 
observed, and returns an asymptotically unbiased estimate of the cardi nality based on the quantity 



2003 



max f li • • • 1 H ^ £ ^t}}- The LogLog counting algorithm (jPurand and Flajolet 

Flajoletl . I2OO4I ) offers an improvement of approximately a factor of 3 in terms of storage require- 
ments (for given accuracy), by estimating the cardinality from the summary statistic maxj^jy p{i), 
avoidin g the need for the bitmap table. Instead of estimating the cardinality from bit patterns, 
Giroird (j2009l ) hashes the data types uniformly to pseudo-random variables in (0,1), stores order 
statistics of hash values falling in disjoint subintervals covering this range, and averages cardinality 
est imates over these subinterv als. This approach is called stoch astic averaging and was introduced 



bv Flajolet and Martin ( 1985 ). The estir nators of Giroird (2009) bas e d on order statistics a r e corn - 
par able in terms o f precision to those of iFlajolet and Martini (jl985l ) , jPurand and Flajoletl (j2003l ) 



and 



Flaiolet! moi ). 



Projection methods for Iq. norm estimation with streaming data are described in llndvkl (j2006l ) 
for a G {1,2}, and references therein. The idea is to hash distinct data types it to independent 
copies of a-stable random variables, an d store weighte d linear combinations of the hash values 
Exploiting properties of the stable law, 
estimates of with a close to zero. 
The seminal paper of 



Cormode et al 



(|2003l ) approximate the cardinality using 



Alon et al, 



(jl999l ) is the first attempt at obtaining tight lower bounds on 



3 



Probabilistic counting algorithms 



Peter Clifford and Ioana A. Cosma 



Bar Youssef et al 



the sp ace complexity of approximating the cardinaHty of a simple data stream 
(|2002l ) present the best (e, (5)-approximation of the cardinality of a simple data stream in terms 
of space requirements, namely O (l/e^ • log(logc) • log(l/(5)) ; an estimator c i s said to be an (e, 5)- 
appro ximation of c, for some e,6 > arbitrarily small, if P (|c — c| > ec) < 6. Ilndvk and Woodruff 



(j2003l ) show that the dependence of the space requirement on e through the factor c anno t 

(I2OO3I ) 



Cormode et al 



be reduced to 1/e. For a general data stream, the (e, 5)-approximation of 
requires a data sketch of length O (l/e^ • log(l/(5)); this result is obtained from upper bounds on 
tail probabilities of the estimator c. We employ the same approach in Section [3.31 to derive storage 
requirements for our algorithms. 



3 Order statistics 

3.1 Continuous random variables 

A data stream in the cash register case provides data elements of the form (i i,dt), where it g It, 



and d t > 0, for t = 1, . . . , T. We start with a simple adaptation of the ideas of iFlajolet and Martin 



(| 19851 ) and iGiroird (j2009l ). which we call the maximal-term data sketch. At time t, the data type it 
is used as the seed of a random number generator to produce the first pseudo-random number h{it) 
uniformly distributed on (0,1). Write h{it) ~ U(0, 1). The algorithm records h'^, the maximum 
value of h{it), as the stream is processed, restarting the random number generator with the seed 
it at each stage. Note that if a particular data type is seen more than once, the value of is 
unchanged, but whenever a new type it is observed, there is a chance that will increase. 

For the idealised U(0, 1) hash function, the variable Y = has density f{y;c) = cy'^~^, y E 
(0, 1), since it is the maximum of c independent U(0, 1) variables where c is the unknown cardinality. 
The quantity c is then an unknown parameter to be estimated by standard statistical methods. To 
increase the efficiency in estimating c, we sample m successive values hi{it), . . . ,hm{it) from the 
random number generator at each stage, and store Yj = hf , j = 1, . . . , m, thus obtaining a sample 
of size m from f{y;c). 

Proposition 1. The MLE of c based on (Yi,...,!^) is c = — X^JLi log i^- with asymptotic 
distribution Normal(c, c^/m) as m ^ 00. The expression 

m 

— c log(Yj) ~ Gamma(m, 1), (1) 
i=i 

can be used as a pivot in setting exact confidence intervals for c. 

Proof. Using standard sampling theory. □ 
Asymptotically, c is unbiased and approximately normally distributed with standard error 
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c/y/m, so that by storing m = 10,000 values, for example, we can obtain an estimate of c to 
within 2% with 95% confidence, regardless of the size of c. 

Remark: When estimating an integer valued parameter, such as the cardinality, the derivatives 
involved in the standard derivation of the large sample distribution of the MLE cannot be calculated. 
Nevertheless, equivalent results can be derived in terms of finite differences, and since the standard 
deviation of the estimators we consider is of the order of c, with c la r ge, th e use of derivatives can 
be justified. For an early discussion of these issues, see iHammerslevI (|l950l ). 

Note that the maximal-term sketch does not allow deletions in the stream, i.e., dt < 0, since it 
does not take into account the value of dt, and thus cannot modify the quantities Yj if Oj^ becomes 
zero. In contrast, the method of data sketching via random projections in Section U] allows deletions 
and permits the estimation of X]igx^('^«(^) > O); provided that Oj(r) > whenever the estimation 
procedure is applied. 



3.1.1 Using the kth order statistic 

A possible improvement might be to store the kth order statistic of the hash values rather than . 

Proposition 2. For k < c, let Yj denote the kth order statistic of the hash values from the jth 
hash function hj ~ U(0, = 1, . . . , m. The MLE c of c based on Yi = yi, . . . , Ym = Vm is the 
unique root of 

I m \ k 

m 



log 111^^1 +E 



i=l 



c-i + l 



0. 



(2) 



When c is large, the root is given approximately by 



1 Tim 1/m ' 

with standard error approximately c/Vkm. Furthermore, the estimator in ^ is recursively com- 
putable. 

Proof. The first part of the proof is straightforward. For the second, recall that a sequence of 
statistics Tm{xi, . . . ,Xm) is said to be recursively computable if 



(Xl, . . . , X-in, w) — TjYi^i (zi, . . . , Zm,w),ym £ N; 



see for example iLauritzenl (jl988l ) who proves, for independent random variables Xi, . . . ,Xm, that 
if Tfn{Xi, . . . ,Xm) is minimal sufficient, then the sequen ce Tm, r n > 1 is recursively computable. 
T his property of sufficient stati stics was first remarked bv iFisheii (|l925l ). It follows from a theorem 
of iLehmann and Scheffg (|l95d ) that the statistic Tra{Yi, . . . ,Ym) = YVjLi^j is minimal sufficient 
for c, so c is also minimal sufficient and hence recursively computable. □ 



5 



Probabilistic counting algorithms 



Peter Clifford and Ioana A. Cosma 



The property of recursive computability is particularly important when dealing with massive 
data sets due to constraints on available storage. For example, suppose two independent estimates, 
ci and C2, of the cardinality c are available, based on samples of size nii and m2- By substituting 
the estimates in ([2]), the associated product terms can be recovered; the combined estimate can 
then be obtained by combining the products and using ^ once again with m = mi + 771-2. When 
c is large, the combined estimate is approximated by 

k 



1 - [(1 - k/cir' (1 - A;/C2)™2]^/('"1+™^) ■ 

Furthermore, we remark that to keep a record of the kth order statistic for each of the m subsets 
as the stream is processed requires storing km values. However, since the standard error of c is 
approximately c/Vkm for large m, there is no gain in accuracy relative to the storage requirement. 

We also note that there is no advantage in using a hash function h that maps to a continuous 
distribution F other than U(0, 1). The MLE of the maximal-term data sketch merely becomes 

TTl 

where Mj = maxhj{i), (3) 



Er=i log i^(M,) -^-1^ 

which has the same distribution as c in Proposition [TJ 



3.2 Discrete random variables 

Hashing to integer values rather than floating point numbers requires less storage, a priority when 
handling massive data streams. We show that the loss of statistical efficiency is negligible when 
integer- valued hash functions are chosen appropriately. We first consider hashing to Bernoulli ran- 
dom variables, not previously considered in the literature, and then to geometric random variables. 

3.2.1 Bernoulli random variables 

To implement hashing to a Bernoulli variable, we start with an array of Os of length m and then 
change the jth element to 1 if hj{it) < p, where, as before, hj{it) is the jth simulated U(0, 1) 
variable from the seed it, j = 1, ■ ■ ■ ,m. The value of p is chosen to maximise Fisher's information. 

Proposition 3. Fisher's information for a Bernoulli hash functions with probability p is maximised 
with Pmax = 1 — exp(— Ao/c) Aq/c, for large c, where Aq = 2 + 2e~^) ^ 1.594, and W is 
Lambert's function. The asymptotic relative efficiency of the MLE of c with Bernoulli hashing 
{p = A/c), relative to the estimator obtained with a continuous hash function, is /{e^ — 1) for 
large c. 

This result enables lower bounds on the asymptotic relative efficiency (ARE) to be specified. 
For example, if c is known in advance to lie in (O.Scq, 4.3co) for some fixed cq, then with p = I/cq 
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the ARE is at least 25%. Consequently, 4m bits of storage suffice to provide the same accuracy as 
storing m floating point numbers when hashing to continuous random variables. 

Proof. After processing the data stream we have observations from m Bernoulli variables, each with 
probability P = 1 — (1 —pY- Fisher's information for P is m/(P(l — P)) and hence the information 
for c is 

m / dP\ mq'^ilogqY 
^(^^ = P(l-P) \ dE) l-q- ' 

where q = 1 — p. Substituting q = exp(— A/c) gives /(c) = mc^'^X^/{e^ — 1). Since Fisher's 
information using continuous variables is m/c^, this gives the asymptotic relative efficiency as 
claimed. The Fisher information from Bernoulli hashing attains its maximum when A is the positive 
root of A = 2(1 — exp(— A)), which can be expressed in terms of Lambert's W function and is given 
approximately by Aq = 1.594. □ 



3.2.2 Geometric random variables 

Suppose that the hash function maps to a geometric random variable with cumulative distribu- 
tion function G^ ix) = 1 — q^, with p + q = 1, x = 1,2,... We note that p = 1/2 is the 
case analysed by iFlaioletl (|2004l ). As before, for the maximal-term data sketch, we store Yj = 
= max{/ij(it); it G 2t}, j = 1, . . . , m, where hj{it) are independently simulated from Gp by the 
method of seeding, and estimate c based on the random sample Yi = yi, . . . , Ym = Vm- Let G^ be 
the distribution function of the maximum of c independent Gp variables. 

Proposition 4. The MLE of c based on a sample Yi = yi, . . . , Y^ = Um drawn from Gp satisfies 
^ log (1 - qy^) (1 - qyf - log (1 - g^'-l) (l - g^--^)' ^ ^ 

k {1 - qyf - {1 - qy^-r 

In the limit as m ^ oo, the distribution of c/c is asymptotically normal with mean 1 and variance 
l/(m^c) where ipc can be approximated by 

■0OO = 2^ 



[exp(g^ 1) - exp(g'=)] 



for large c. 

Proof. The log-likelihood function is 



L(yi, . . . ,y^;c) = J^log {(1 - g^^' " (1 " q'^-'Y} ■ 
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Formally differentiating with respect to c, we have the score function as given in Squaring and 
taking expectations in the case m = 1, we have Fisher's information per observation: 

^ ^ [iog(i - qy)ii - qyy - iog(i - qy-^){i - qy-^rf 

A> {I _ qyy - {I - qV-^Y 

As m — )• oo, from the usual large sample theory of maximum likelihood estimation, c/c is asymptot- 
ically normally distributed with mean 1 and variance l/(m'0c) where V'c = (?I{c). Now let c — t- oo 
through the sequence c = q~^ , where r is a positive integer. Writing y = r + k, we have 

hm c2/(c)= y ^ , 

c->-oo [exp(g'^ ^) — exp(g'^)] ' 

as claimed. □ 

In practice, to solve for c in (jl]), one iteration of the Newton- Raphson algorithm started from a 



consistent estimator of c produces an asymptotically efficient estimator (jRad . Il973l ) . A consistent 
estimator is c = log(r/m)/ log(l — g"), where r = \{yj;yj < n}\ and n = [logg(l/2)J, if r / 0, else, 
set c = T, the length of the stream observed. 

The statistical efficiency of the maximal-term MLE in the geometric case can be made arbitrarily 
close to that in the continuous case. For large c, the Fisher information is an increasing function 
of g as q — )• 1. In particular, for q = 10/11, the ARE of the estimator of c based on a sample of 
maxima from Gp as compared to the estimator based on a random sample of ma xima from any 



continuous distribution is 0.9985. For the special case considered bv lFlaioletl (j2004l ) with p = 1/2, 
the asymptotic relative efficiency is 0.9304. 

We note that the estimator c, based on a sample of maxima from Gp, does not have the property 
of recursive computability, unlike the estimator in the continuous case. Nevertheless, when q 
approaches 1, the geometric distribution is well approximated by the exponential distribution with 
parameter A = — log q, so the log-likelihood is approximately 

m m 

L{yi, . . . , y^; c) = mlog(cA) + (c - 1) {log (l - g-^w) } - A ^ y,. 

For this distribution, the statistic Sm = Hjli ~ e.~^^^^ = Iljli i}- ~ Q^^) sufficient for the 
parameter c, and the MLE is c = —m/ log Sm, so that, to this degree of approximation, recursive 
estimation is possible. 
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3.3 Storage requirements 



In this section we determine exponentially decreasing upper bounds on the tail probabilities of 
our estimators, and show that in the geometric case, the storage r equirement of an algorithm 
implementing the estimation procedure attains the tight lower bound of llndvk and Woodrufla (|2003l ) . 



Proposition 5. In the continuous case, the tail error bounds for the estimator c given in 1^ are 
P (c > (1 + e)c) < exp(-me^/Ci), and P (c < (1 - e)c) < exp(-me^/C2), 



where 



Ci 



Co 



-e + (l + e)log(l + e)' e + (1 - e) log(l - e) ' 

In the limit as e — )• 0, the constants Ci and C2 tend to 2, so for small e, the tail error bounds are 
exponentially decreasing in me^ . 

Proof. In the continuous case, the pivotal quantity mc/c has a Gamma distribution with moment 
generating function (1 — t)"^,t < 1. The bounds on th e tail probabilit ies can then be obtained from 
the moment generating function using the method of IChernoffI (|l952l ). □ 



In the discrete geometric case, these results hold to arbitrary accuracy by approximating the 
geometric distribution by an exponential distribution with mean —logq and q close to 1. From 
Proposition [5l c is an (e, 5)-approximation of c provided that m = 0(e~^). The expected value 
of the maximum order statistic ba sed on a sample of size c from Gp is O(logc) for fixed p 
([Kirschenhofer and Prodingeil . Il993l ). It follows that the space requirement of an algorithm im- 
plementing the estimati on procedure in the g e ometr ic case is of order 0(e~^ log(log c)) , attaining 
the tight lower bound of llndvk and Woodruffl (|2003l ). 



4 Random projections 



Data sketching via randor n proj e ction s (jCormode et all 120031 : llndvkl . l2006l ) exploits properties of 
the a-stable distribution (|Levvl . ll92J). The stability property lies at the heart of the random 
projection method. For simplicity, we restrict attention to pos itive s t rictly s table variables o f index 



1971 



Zolotarevl . Il98fil ^. Let 



a, for a £ (0, 1), having Laplace transform e~ , A > (jFelled . 
Fa denote the distribution function. The stability property of is as follows: if Xi,X2 
independently, and ai and 02 are arbitrary positive constants, then 



aiXi + 02X2 = (af + a^y^"x. 



(5) 



where X ^ F^. See, for example. 



Feller (1971 
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T he random pr oiection method for cardinahty estimation proceeds as follows ([Cormode et alJ . 



2003 



Indvk 



20061 ). For j 



1, 



, m 



and a S (0, 1) fixed, let hj be independent hash functions 
mapping from X to samples from Fa, via the usual method of seeding; in practice, this will involve 
constructing simulated Fa variables from pairs of U(0, 1) variables. Then update and store the 
projections Vj{T) = X^tLi dthj{it), j = I, . . . ,m, to give the data sketch Vi, . . . , Vm, where we write 



Vj{T) for brevity. By the stability property in ([5]), we have that 

Vj = J2j=i dthj{it) = J2teiT = ^a(a)Xj, 



(6) 



where Xj ~ 



Fa independently for j = l,...,m, and ^o(a) = (X^jgj^ af )"'^^". In other words 



Vi, . . . , Vm is a sample from a scale family with unknown scale parameter ^Q,(a). It should be noted 
that for a simple data stream, ^o(a) = (^jgj^ nf)^^" where rij is the number of times that item i 
is observed in the data stream by time T. 

In principle, calculation of the MLE of the scale parameter, £a{^) in & is straightforward. 
Raising this MLE to the power of a gives the MLE of Y^ieXr ^i^' with a sufficiently small this 
produces an approximation to c. In prac tice, there are seve re numerical difficulties in obtaining the 
MLE when a is small; see, f or exa mple, iNolanI (|l997l . boOlh . 



Instead, 



Cormode et al 



(|2003l ) estimate ^Q,(a) by V/jl, where V is the sample median and fl 
is the numerically determined median of Fa- They show that an (e, (5)-approximation to c can be 
obtained by choosing m of order O (l/e^ • log(l/(5)) and < q < e/log(i?), where B is an upper 
bound for the elements of a. 

We adopt a slightly different approach and exploit the limiting distribution of V^" for small a. 

Proposition 6. As a — > the random variable 



!/• " — Gamma(m, 1). 



(7) 



Consequently, the variable can be used as an approximate pivot in setting confidence intervals for 
a. For a small the estimator c = rn/ '^'JLiV~°' has asymptotic distribution Normal(c, c^/m) as 



m — )■ oo. 



Proof. IZolotarevI (|1986l ) shows that — )• 1/Z where Z ~ Exp(l), as a — )• 0. It follows from 
that 



^3 



aihj{i) 



V 



X" ^ a'f ^ c/Z, j = 1, . . . ,m (independently). 



and hence cY^Y=i ~^ Gamma(m, 1). The estimator c is obtained by equating the pivot to its 
mean m and the approximate distribution of c then follows from the asymptotic normality of the 
Gamma distribution. □ 
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When comparing the estimator c above with c = V/Jl in ICormode et al.l (|2003l ). we are ef- 
fectively comparing the MLE of the parameter of an exponential distribution with an estimator 
obtained by equating the sample and population median. The ARE of c to c is then approxi- 
mately 48% since by using the standard asymptotic distribution of sample medians, we find that 
c ~ Normal (c, c'^(log 2)^^/m) for large m, i.e., c is twice as efficient asymptotically as c. 

At this stage we have shown that the estimators of c using the maximal-term or random pro- 
jection sketches can have comparable efficiency. This leads us to conjecture that in some sense the 
methods are essentially equivalent, which we explore in the next section. 

5 Comparison of projection and maximal- term sketches 

In Section 13.11 we showed that the efficiency of the maximal-term data sketch does not depend 
on the particular continuous distribution that is simulated by the hash function. For the purpose 
of comparison, we now hash to F^, in both cases. Note that we are not proposing to use this 
distribution directly for the maximal term estimator since it has an extremely heavy tail when a is 
small. Storing the maximum of c such variables, for c large, would require high precision floating 
point numbers. 

Consider a data stream in the cash register case, observed up to time T. Let a denote the 
accumulation vector, and c the cardinality. For j = 1, . . . , m, let hj be independent hash functions 
mapping from X to copies of X ~ Fa, for fixed a G (0, 1). Let Cp = m/ X]j=i be the projection 
estimator defined in Proposition O and let Cm denote the maximal-term estimator in ([3D where 
Mj = maxjgXj, (0 where F is the distribution function of X". 

Theorem 1. For small a, the pivotal quantities for the maximal-term and projection sketches are 
equivalent, i.e., 

m m 

cJ2 + log F{Mj) ^0, asa^O, 

j=i j=i 

_ p 
and in particular " + log F{Mj) — )• for each j = 1, . . . ,m. 

Proof. Let M = maxjgjy X" be a typical maximal term in ([3]) with Xi ~ Fn,i & and le t 6 > 
be arbitrary. Since P{M < y) < P(X" < y) and X"" ^ Exp(l) as a (jzolotarevl . llQHfil ). there 
are values oq and yo > such that P{M < yg) < ^ for all a < oq. 

Now let Ga{y) be the distribution function of X", i.e., Ga = F in Since — )• Exp(l), 
then Ga{y) — )• exp(— 1/y), uniformly in y > 0, and consequently logGa(y) — )• —l/y uniformly in 
y > yo as a — )• 0. It follows, by the usual arguments, that 

log Ga{M) + 1/M 4 0, as a ^ 0. (8) 
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Finally, writing V = '^ZieXr ^i-^i typical term in ([6]), we have 

M ^ Omin — ^maxOmin ^ ^ ^ ^max ^ ^ 0,i — ^ ^ ^ 

where Xmax = maxjgij, X-i and Omin = minjgXj, Oj. It follows that 



and as a ^ 0, y°/M — t- 1. Since both M and V"' have proper limiting distributions, this implies 
that 

1 1 p 

— 7T — as Q — )■ 0, 

I/" M 



and together with ([8]) we have 



log F{M) + ^0 as a ^ 0. 



We have established that the terms in the summations are individually equivalent for small a and 
since the number of terms, m, is finite the result is proved. □ 

Note that the specific values of > are unimportant in determining the cardinality. For 
practical purposes, positive values of dt can be taken to be 1 and this may have the effect of 
improving the bounds in @. 

6 Conclusion 

In this paper we discuss the problem of cardinality estimation over streaming data, under the 
assumption that the size of the data precludes the possibility of maintaining a comprehensive list 
of all distinct data elements observed. Probabilistic counting algorithms process data elements 
on the fiy in three steps: (i) hash each data element to a copy of a pseudo-random variable, (ii) 
update a low-dimensional data sketch of the stream, and (iii) discard the data element. For this 
purpose, we present two approaches: indirect record keeping using pseudo-random variates and 
storing either selected order statistics, or random projections. The data sketch is a random sample 
of variables whose distribution is parameterised by the cardinality as unknown parameter, and we 
derive estimators of the cardinality in a conventional statistical framework. 

We analyse the statistical properties of our estimators in terms of Fisher information, asymptotic 
relative efficiency, and error bounds on the estimation error, and the computational properties in 
terms of recursive computability and storage requirements. Finally, we demonstrate an unexpected 
link between the method of maximal-term sketching based on hashing to the Fa distribution, and 
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the method of random projections, showing that the two methods are essentiahy the same when 
a is smah. However, since there is no gain in efficiency for the maximal-term sketch in using the 
Fa distribution, rather than the simpler U(0, 1) distribution, as shown in Section [3.11 the latter is 
to be preferred. Moreover, since we show in Section [3.21 that discrete hash functions are capable 
of comparable efficiency but with reduced storage requirements, discrete maximal-term methods 
must be the method of choice. In fact, algorithms implementing our estimation procedure with 
discrete maximal-term sketching and geometric hashing attain the tight lower bound on storage 
requirements for cardinality estimation. 
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